Linear modeling

Douwe Molenaar

Systems Biology Lab

2024-11-01

Modeling data

Learning from data

  • Unsupervised learning: what is the “structure” of the data?
    • Are there groups/clusters, how many?
    • What is the diversity/similarity between objects?
    • Which are the outliers?
    • \(\cdots\)
  • Supervised learning:
    • Fitting the data using a modeling framework
    • Use the fit to predict some variables when others were measured
    • Find informative variables among many potential predictors
    • Inference: testing of hypotheses concerning properties of variables. In particular, is there a genuine dependence with the predictor variables?
    • \(\cdots\)

Unsupervised learning: exploring data

Ability to graphically display complex data is crucial

  • Sample grouping/clustering algorithms like K-means or Hierarchical clustering
  • Dimension reduction techniques like PCA, tSNE, UMAP

This course: mainly supervised learning

  • Predictor variables
    • mRNA level of each of 20000 genes (numbers)
    • concentration of each of 100 blood metabolites (numbers)
    • concentration of each of 500 bacterial species on the skin (numbers)
    • gender (a label)
    • smoker (a label)
  • Response variables
    • growth rate (a positive number)
    • health status / disease class (a label)
    • probability of 1 year survival (a number between 0 and 1)

Terminology and representation in R


Types of variables

Type of variable

Official name

R data type

Continuous number

numeric

numeric

Integer

numeric

integer

Label

categorical

factor


Types of models

Predictor variables

Response variable

Type of model

categorical, numeric

categorical

classifier

categorical, numeric

numeric

regression

Supervised learning

Premise

One or more of the predictor variables must carry information about the response variable

  • These predictor variables1 and the response variable must be statistically dependent2

Goals

  • To fit3 several mathematical functions \(y = f(x_1, x_2, \ldots, x_n)\), \(y = g(x_1, x_2, \ldots, x_n)\), \(\ldots\) that allow us to
    • Predict the value or class label of response variable \(y\) of a sample using one or more predictor variables, \(x_1, x_2, \ldots, x_n\), measured on that sample
    • To find the type of dependence between response and predictor variables
    • To find out which of the predictor variables contain information about the response

Many frameworks for statistical modeling

The functions \(f, g, \ldots\) can be from the framework of

  • Linear modeling: a function of \(x_1, x_2, \ldots\) with linear parameters
  • Nonlinear modeling: a function of \(x_1, x_2, \ldots\) with non-linear parameters
  • Neural network: a complex nonlinear function of \(x_1, x_2, \ldots\) with weights as parameters
  • Classification and Regression Trees: a decision tree with decision boundaries on \(x_1, x_2, \ldots\) as parameters
  • \(\cdots\)

Example: predicting income

Goal: Predicting income from Seniority and Years of Education

Some of the questions you can ask

  • Which of the variables \(x_1, x_2, \ldots\) contain information about \(y\)?
    • Are there objective ways to test dependence of variables?
  • How complex should the model function be?
    • Higher complexity: more variables or more terms with these variables, more parameters to tune
    • Usually: the more complex, the better the fit, but is it useful?

Is the model to the right a better model?

Take a step back

Proposition

The best model is the one that fits the training data best

  • Take a few minutes to ponder over this proposition
  • Discuss opinions with your neighbours

Things to consider

  • What is the goal of your model? What do you want to do with it?
  • Can you think of another way than “best fit to training data” when comparing models?
  • How would you assess your definition of “better model”?

Statistical models and variance

The simplest model

What is the best model you can use in the absence of predictor variables?

\[y = \beta_0\]

The least-squares fit of this model minimizes the sum of squared differences:

\[ S(\beta_0) = \sum_{i=1}^n (y_i - \beta_0)^2 \]

It is obtained1 when \(\beta_0\) equals the average (sample mean) of measurements \(y_i\):

\[ \beta_0 = \frac{1}{n} \sum_{i=1}^n y_i = \overline{y} \]

\(y=\overline{y}\) will be called the Null Model or M0

Residual sum of squares of the Null Model

The Residual Sum of Squares (RSS) of the Null Model is:

\[ \text{RSS} = \sum_{i=1}^n (y_i - \overline{y})^2 \]

The RSS of the Null Model is also called the Total Sum of Squares (TSS) of the data set \(y_i\)

The Mean Square Error (MSE) of the Null Model is:

\[ \text{MSE} \equiv \frac{\text{RSS}}{n} = \frac{1}{n} \sum \left(y_i - \overline{y} \right)^2 \]

Do you recognize the right hand side?

It is almost the same as the sample variance \(s^2(y_i) = \frac{1}{n-1} \sum \left(y_i - \overline{y} \right)^2\)

Good models “explain” part of the variance


Model M0: \(y=\beta_0\)

\(\text{RSS} \, (\equiv \text{TSS}) = 10^{4}\)


Model M1: \(y=\beta_0 + \beta_1 x\)

\(\text{RSS} = 800\)

Variance explained or \(R^2\)

The Total Sum of Squares (TSS) of a response data set \(y_i\) was defined as

\[ \text{TSS} = \sum_{i=1}^n \left( y_i- \overline{y} \right)^2 \]

The Residual Sum of Squares (RSS) of a model \(y=f(x_1,x_2, \ldots)\) is

\[ \text{RSS} = \sum_{i=1}^n \left( y_i- f(x_{i1},x_{i2}, \ldots) \right)^2 \]


The variance explained is defined as \(R^2 = \frac{\text{TSS}-\text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}\)


What are the maximal and minimal possible values of \(R^2\)? (Think carefully!)

Linear modeling: the principles

Subjects treated

  • What are linear models?
  • Linear models can fit quite complex data
  • How they can be used for categorical and numerical predictors, or combinations of these
  • In R, all linear modeling is handled by the function lm()
  • The meaning of parameter confidence intervals, and how to determine them

Definition of a linear model

In a linear model the parameters \(\beta_i\) of the model appear linearly in the model equation: each parameter occurs with power 1 in the model equation.


Examples of linear model equations

\[ \begin{align*}y &= \beta_0 + \beta_1 x \\ y &= \beta_1 x + \beta_2 x^2 \\ y &= \beta_1 x + \beta_2 \frac{1}{x} \\ y &= \beta_0 + \beta_1 e^{x} \end{align*} \]


Counter examples (non-linear models)

\[ \begin{align*} y &= \frac{\beta_0 x}{\beta_1 + x} \\ y &= \beta_1 x + \frac{1}{\beta_2 x} \\ y &= \beta_0 + \beta_2 e^{\beta_3 x} \end{align*} \]

Definition of a linear model

In a linear model the parameters \(\beta_i\) of the model appear linearly in the model equation: each parameter occurs with power 1 in the model equation.

Surprisingly:

  • a predictor \(x\) can appear in non-linear fashion, like \(x^2\), \(\frac{1}{x}\), \(e^x\), etc.
  • the same predictor variable can appear multiple times
    • Example: \(\beta_1 x + \beta_2 x^2 + \beta_3 x^3\)
    • The different terms \(x\), \(x^2\), \(x^3\) can be considered as different predictors, as long as they are linearly independent!

Comparing the data to predictions of the model equation

  • Measure \(y\) at \(i = 1 \ldots n\) different values of \(x\)
  • This yields a set of \(n\) 2-tuples \({(y_i, x_i)}\)
  • Propose a model \(y = f(x, \beta_0, \beta_1) = \beta_0 + \beta_1 x\)
  • Comparing \(y_i\) to the predictions \(y_i'=f(x_i, \beta_0, \beta_1)\), yields residuals or errors \(\epsilon_i = y_i - y_i'\)
\(i\) \(x_i\) \(y_i\) \(f(x_i, \vec{\beta})\) \(\epsilon\)
1 0 0.1 0.2 -0.1
2 1 2.4 2.2 0.2
3 2 5.5 6.0 -0.5
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(n\) 100 1.1 0.0 1.1

Data and predictions of a model with multiple predictor variables

  • Measure \(y\) at \(i = 1 \ldots n\) different values of \(x_{i}\)
  • Yields a set of \(n\) tuples \((y_i, x_{i1}, x_{i2}, \ldots, x_{im})\), abbreviated as \((y_i,\vec{x}_i)\)
  • When comparing \(y_i\) to the predictions \(y_i'=f(\vec{x}_{i},\vec{\beta})\), errors \(\epsilon_i = y_i - y_i'\) remain
\(i\) \(x_{i1}\) \(x_{i2}\) \(\ldots\) \(x_{im}\) \(y_i\) \(f(\vec{x}_{i},\vec{\beta})\) \(\epsilon\)
1 0 1 \(\ldots\) 0 0.1 0.11 -0.01
2 1 1 \(\ldots\) 0 2.4 2.38 0.02
3 2 1 \(\ldots\) 0 5.8 6.0 -0.2
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(n\) 100 3 \(\ldots\) 5 -0.1 0.0 -0.1

Assumptions about the residuals, model notation

For linear models we assume that residuals are drawn from a normal distribution with mean \(0\) and constant standard deviation \(\sigma\).

Notation:

\[ y = f(\vec{x}_{i}, \vec{\beta}) + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]

Examples

\[ \begin{align*} y &= \beta_0 + \beta_1 x + \epsilon &\qquad \epsilon \thicksim \text{Norm}(0,\sigma) \\ y &= \beta_1 x_1 + \beta_2 x_1 x_2 + \epsilon &\qquad \epsilon \thicksim \text{Norm}(0,\sigma) \end{align*} \]

Same model, other notation

\[ \begin{align*} y &\thicksim \text{Norm}(\beta_0 + \beta_1 x, \sigma) \\ y &\thicksim \text{Norm}(\beta_1 x_1 + \beta_2 x_1 x_2, \sigma) \end{align*} \]

Interpretation: \(y\) itself is drawn from a normal distribution whose mean depends on \(x, x_1, x_2\)

A general way to write linear models

Having a linear model equation

\[ y = f(\vec{x},\vec{\beta}) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_m x_m + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]

Having \(n\) measurement tuples \((y_i, x_{ij})\) with \(i=1 \ldots n\) and \(j=1 \ldots m\), the predicted responses \(y'_i = f(\vec{x}_i, \vec{\beta})\) can be written as the matrix equation

\[ \begin{bmatrix} y'_1 \\ y'_2 \\ \vdots \\ y'_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1m} \\ 1 & x_{21} & x_{22} & \ldots & x_{2m} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{nm} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_m \end{bmatrix} \]

abbreviated as

\[ \vec{y}' = \boldsymbol{X} \vec{\beta} \]

Example

For model equation

\[ y = f(x,\vec{\beta}) = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \qquad \epsilon \thicksim \text{Norm}(0,\sigma) \]

this becomes

\[ \begin{bmatrix} y'_1 \\ y'_2 \\ \vdots \\ y'_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \]

The general way to write linear models

Notice that in the expression \(\boldsymbol{X}\vec{\beta}\):

  1. the first column of \(\boldsymbol{X}\) is a column of 1’s to accommodate the offset \(\beta_0\)
  2. the subsequent columns could represent:
    • different variables \(x_1\), \(x_2\), … (equation \(y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots\))
    • different powers of the same variable \(x\), \(x^2\), … (equation \(y=\beta_0 + \beta_1 x + \beta_2 x^2 + \ldots\))
    • products of different variables \(x_1 x_2\), \(x_1 x_2 x_3\), … (equation \(y=\beta_0 + \beta_1 x_1 x_2 + \beta_2 x_1 x_2 x_3 + \ldots\))
    • etc.

Anything is allowed for the predictor terms:

  • As long as columns of \(\boldsymbol{X}\) are linearly independent.
  • I.e. no column of \(\boldsymbol{X}\) is allowed to be a linear combination of one or more other columns.

Comparing predicton of the model and measured values

  • In general it is impossible to choose values \(\beta_i\) of \(\vec{\beta}\) that make the predicted values \(\vec{y}'=\boldsymbol{X}\vec{\beta}\) equal to the measured values \(\vec{y}\)
    • Because we have more (linear) equations than parameters
    • Hence, the vector of measured responses \(\vec{y}\) is in general not in the column space \(\mathcal{C}(\boldsymbol{X})\) of \(\boldsymbol{X}\)

Example:

\[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_n & x_n^2 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} \] has in general no solution for \(\vec{\beta}\)

The deviation between measurements and predictions is caught in an error vector \(\vec{\epsilon}\)

\[ \vec{\epsilon} = \vec{y} - \vec{y}' = \vec{y} - \boldsymbol{X} \vec{\beta} \]

Back to What we mean by “fitting a function”

Fitting a function to the data means tuning values for the parameters \(\beta_0, \beta_1, \ldots\) such that the sum of squared errors is minimized, or residual sum of squares:

\[ \text{RSS} (\vec{\beta}) = \sum_{i=1}^n \epsilon_i^2 = \|\vec{\epsilon}\|^2 = \|\vec{y} - \boldsymbol{X} \vec{\beta}\|^2 \]

  1. Equivalently: the norm (length) of vector \(\vec{\epsilon}\), \(\|\vec{\epsilon}\|\) needs to be minimized
  2. The vector \(\vec{y}' = \vec{y} - \vec{\epsilon} = \boldsymbol{X} \vec{\beta}\) must be in the column space of \(\boldsymbol{X}\) to be able to obtain a solution for \(\vec{\beta}\)

The solution is relatively simple

The perpendicular projection of \(\vec{y}\) onto the vector \(\vec{y}'\) in the column space \(\mathcal{C}(\boldsymbol{X})\) of \(\boldsymbol{X}\) is the least-squares solution

Perpendicular projection

So, \(\vec{\epsilon}\) must be perpendicular to the column space \(\mathcal{C}(\boldsymbol{X})\) of \(\boldsymbol{X}\), or

\[ \boldsymbol{X}^T\vec{\epsilon} = \vec{0} \]

Since \(\vec{\epsilon} = \vec{y} - \boldsymbol{X}\vec{\beta}\), we get

\[\boldsymbol{X}^T \left( \vec{y} - \boldsymbol{X}\vec{\beta} \right) = \vec{0}\] This is the equation to be solved for \(\vec{\beta}\) to obtain the least-squares fit, or \[ \boldsymbol{X}^T \boldsymbol{X} \vec{\beta} = \boldsymbol{X}^T \vec{y} \]

This equation can be solved for \(\vec{\beta}\) by Gaussian elimination

Solving linear models in R

All linear models can be fitted with lm()


Our model equation:

\[y = \beta_0 + \beta_1 x + \beta_2 x^2\]

Expressed in an R-formula:

model <- lm(y~x+I(x^2), data=d)

Data and fitted model

Information about the model fit

Fitted parameter values (coefficients)

Confidence intervals for the coefficients

summary(model)

Call:
lm(formula = y ~ x + I(x^2), data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0840 -3.1905 -0.2577  1.7967  7.9306 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.89377    2.37222   3.749  0.00243 ** 
x            7.52765    0.73380  10.258 1.34e-07 ***
I(x^2)       0.59503    0.04719  12.609 1.15e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.567 on 13 degrees of freedom
Multiple R-squared:  0.9982,    Adjusted R-squared:  0.998 
F-statistic:  3697 on 2 and 13 DF,  p-value: < 2.2e-16
confint(model)
                2.5 %     97.5 %
(Intercept) 3.7689034 14.0186459
x           5.9423560  9.1129343
I(x^2)      0.4930864  0.6969831

Interpreting confidence intervals for coefficients

The meaning of a 95% confidence interval for the estimates \(\hat{\beta}\)

  • If you would repeat the experiment and fit each time, you would get multiple, different confidence intervals
  • 95% of these intervals will contain the true value of \(\beta\)
  • For a single experiment we say that the chance that its 95% confidence interval contains the true value \(\beta\) equals 0.95

We can also use lm() when the predictor is categorical!

Lengths of adult males and females


\(x\) \(y\) (cm)
M 173
F 177
M 182
\(\vdots\) \(\vdots\)
F 165

Model definition and results

Our model equation is:

\[y = \beta_0 + \beta_1 x\]

model <- lm(y~x, d)
coefficients(model)
(Intercept)          xM 
  165.31667    15.04167 

Interpretation of the coefficients:

  • (Intercept): the average length of females (F)
  • xM: average length of males minus average length of females

Graphical interpretation of the model

Under the hood a dummy variable \(x'\) with values 0 (for F) or 1 (for M) is defined

The line is a fit of the model \(y = \beta_0 + \beta_1 x'\), i.e. to the data with the dummy variable

  • Its intercept \(\hat{\beta_0}\) equals the average height of females
  • Its slope \(\hat{\beta_1}\), i.e. the increase over a unit interval, equals the difference between the average height of men and the average height of women

A different point of view of the same

Model equation: \(y = \beta_0 + \beta_1 x'\)

  • For females (\(x' = 0\)) we effectively fit the model \(y = \beta_0\)
  • For males (\(x' = 1\)) we effectively fit the model \(y = \beta_0 + \beta_1\)
  • The least squares estimate \(\hat{\beta_0}\) will be the average of the measurements \(y_i\) for the females
  • The least squares estimate \(\hat{\beta_1}\) will be the difference of the average \(y_i\) for males minus that for females

Statistical inference

Three equivalent questions

In the example of body lengths of males and females:

  • Is the slope \(\beta_1\) significantly different from 0?
  • Is the difference in body lengths significant?
  • Does the variable gender carry information about the variable height?

Is the difference in body lengths significant?

summary(model)

Call:
lm(formula = y ~ x, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.858  -4.923  -1.258   5.367  11.642 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  165.317      1.967  84.050  < 2e-16 ***
xM            15.042      2.782   5.408 1.98e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.814 on 22 degrees of freedom
Multiple R-squared:  0.5707,    Adjusted R-squared:  0.5511 
F-statistic: 29.24 on 1 and 22 DF,  p-value: 1.975e-05

Or, using confidence intervals

Is the difference in body lengths significant?

confint(model)
                 2.5 %    97.5 %
(Intercept) 161.237585 169.39575
xM            9.272974  20.81036

The value 0 is outside the 95% confidence interval of xM, so we would discard the hypothesis that \(\beta_1=0\) at a rejection level \(\alpha=0.05\)

When discrete variables have more than 2 levels

Dummy encoding of a categorical variable with

  • 2 levels: \(x' = 0\), \(x' = 1\)

  • 3 levels: \(x' = [\begin{smallmatrix}0 \\ 0 \end{smallmatrix}]\), \(x' = [\begin{smallmatrix}0 \\ 1 \end{smallmatrix}]\), \(x' = [\begin{smallmatrix}1 \\ 0 \end{smallmatrix}]\)

  • 4 levels: \(x' = \Bigl[\begin{smallmatrix}0 \\ 0 \\ 0 \end{smallmatrix}\Bigr]\), \(x' = \Bigl[\begin{smallmatrix}1 \\ 0 \\ 0 \end{smallmatrix}\Bigr]\), \(x' = \Bigl[\begin{smallmatrix}0 \\ 1 \\ 0 \end{smallmatrix}\Bigr]\), \(x' = \Bigl[\begin{smallmatrix}0 \\ 0 \\ 1 \end{smallmatrix}\Bigr]\)

etc: with every level a new dimension is added.

In this dummy encoding scheme only one of the dimensions can have a value 1.

When discrete variables have more than 2 levels

Example of a model equation having a term with such a dummy variable with three levels:

\[y = \beta_0 + \vec{\beta} \vec{x}'\]

in which the term \(\vec{\beta}\) contains two parameters, each of which is “selected” depending on the value of the dummy \(\vec{x}\).

Example:

\[[\beta_{1}\, \beta_{2}]\Bigl[\begin{smallmatrix}0 \\ 1 \end{smallmatrix}\Bigr] = \beta_{2}\]

The dummy variable can assume three values: \([\begin{smallmatrix}0 \\ 0 \end{smallmatrix}]\), \([\begin{smallmatrix}1 \\ 0 \end{smallmatrix}]\), and \([\begin{smallmatrix}0 \\ 1 \end{smallmatrix}]\)

Equivalent, but more elaborate way of writing

The equivalent of the model \(y = \beta_0 + \vec{\beta} \vec{x}'\) is the set of equations:

\[y = \begin{cases} \beta_0 & \text{if } x =\text{level 1} \text{, or } [\begin{smallmatrix}0 \\ 0 \end{smallmatrix}] \\ \beta_0 + \beta_{1} & \text{if } x =\text{level 2} \text{, or } [\begin{smallmatrix}1 \\ 0 \end{smallmatrix}] \\ \beta_0 + \beta_{2} & \text{if } x =\text{level 3} \text{, or } [\begin{smallmatrix}0 \\ 1 \end{smallmatrix}] \\ \end{cases}\]

Graphical interpretation with 3 levels



\(x\) \(y\)
A 98
B 138
C 172
\(\vdots\) \(\vdots\)
B 146

The fit is a plane

The corresponding code in R

No worries: the lm() function takes care of dummy encoding

model <- lm(y~x, data=d)
coef(model)
(Intercept)          xB          xC 
   66.95253    53.72300    71.10385 


Interpretation

  • (Intercept): the average \(y\)-value for class “A” samples
  • xB: average value of class “B” minus average value of class “A” samples
  • xC: average value of class “C” minus average value of class “A” samples

Class “A” is apparently used as a reference

Appendix

Literature

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Springer Texts in Statistics. New York: Springer. https://www.statlearning.com/.